An R-package to facilitate high-throughput ecological inferences of time-series RNAseq data.

Introduction:

Micrbiobial Ecologists often use genomic content to infer the biogeochemical processes and interactions within a microbial community. When organisms overlap in their genomic content, they are often inferred to possess similar traits. Conversely, divergent genomic content is indicative of niche differentiation.

Transcriptomics data may aid in making these ecological inferences by providing information on the regulation of this genetic content. Congruent transcriptional responses (CTRs) of particular metabolic modules in disparate organisms within a community may be indicative of common metabolic features shared by many organisms. Conversely, when divergent transcriptional responses are observed across numerous lineages, this is indicative of the disparate niches in the community.

In this package we present a statistical framework to identify CTRs of (pre-defined) metabolic modules in microbial communties. In this manner, genomic bins may be clustered into sub-networks that share responses for particular modules, thereby facilitating inferences on functional redundancy (e.g. intra-cluster), niche partitioning (inter-cluster), and the emergence of complex traits (integrating infromation about multiple modules).

To demonstrate the utility of this approach, we identify CTRs in polymer storage (polyphosphate, glycogen, and polyhydroxyalkanoates) and amino acid biosynthesis modules of 38 genome bins recovered from a bioreactor operated under conditions that select for organisms capable of storing polymers.

A) Importing Raw Data

  • Structure (Tab deliminated)
    • Locus Tag, Raw Counts, Annotations
  • Import KEGG Orthology Table
  • Statistics for normalize: Total reads per sample, total reads mapped per genome, log(2,RPKM)

B) Define Functions

Function Name Description
1 RNAseq_Normalize A function used to normalize raw counts based on various inputs.
2 Create_Rank_Columns Calculate the ranks of each transcript
3 which_rows_with_no_sd Identify rows with a SD of 0
4 Calc_Norm_Euc Calculates the Euclidean Distance between the normalized ranks
5 Calc_Jaccard Calculates the Jaccard Distance between to vectors (Presence/Absence)
6 Presence_Absence_Matrix Calculates P/A matrix for all KOs represented >N times in the dataset
7 Jaccard_Distance_Function Calculates all pairwise Jaccard Distances for a module
8 Individual_KOs_Background Calculates the background distribution for pairwise comparisons of transcripts across bins
9 Generate_Random_Module Generates a random module of size N
10 Background_Distribution_Modules Calculates a distribution for a random module of size N
11 NRED_Distance_Function Calculating normalized rank euclidean distances between modules
12 P_Distance_Function Calculating Pearson Correlations between modules
13 P_NRED_Distance_Function Calculating composite distances between modules
14 Cor_Matrix Output Distance Matrix for both Pearson and Euclidean Distances (implemented in Function F13)
15 ave_Z_score_Func Convert array into matrix of average scores (may be used on outputs for F11-F13)
16 cluster_func Define clusters using the louvain algorithm
17 array_fig Make figure from array

C) Calculate background distributions for individual KOs and modules of ‘N’ KOs

D) Calculate Distances

  • All pairwise Pearson Correlations (PC), Normalized Rank Euclidean Distances (NRED) for each KOn in each module, across all bins (N*#genomes^2)
  • Jaccard Distances between Genome A and B (\(J_{AB}\))
  • Calculate composite score (CS)
    \[\begin{align*} CS_{AB} &=Average(KO_{1},\cdots,KO_{n})/J_{AB} \end{align*}\]

The number of calculations is equal to #genomes2*#KOs

E) Calculating statsitics for each module

The distribution of pairwise distances within each cluster may be compared to a background distribution to calculate a p-value. A small p-value indicates that a strong intra-cluster CTR.

In a similar manner, the KOs forming each module (each KO), may be organized based on the distribution of pairwise distances.

Section A) Importing Raw Data & Data structure

First the necessary librarys are imported, and the random seed is set so that the results are reproducible.

set.seed(1396558)
library("plyr")
library("reshape")
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any
library("igraph")
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library('knitr')
library("hexbin")
library("RColorBrewer")

Next, the necessary files are uploaded including

KO_pathways_table<-read.table("~/Documents/Research/KO/ko00001.keg_tab_delimited",sep="\t",quote = "", row.names = NULL, stringsAsFactors = FALSE)

RNAseq_Annotated_Matrix<-read.table("~/Documents/Research/Ecology_of_BGC/EBPR_RNAseq_Annotation_Matrix",sep="\t",quote = "", row.names = NULL, stringsAsFactors = FALSE, fill = TRUE)

Modules are defined based on their KEGG Orthology (KO) annotations

All_KOs<-names(which(table(RNAseq_Annotated_Matrix$KO)>=5))[-1]
PHA_module <- c("K01895","K00925","K00625","K00626","K00023", "K03821")
polyP_module <- c("K00937","K02040","K02037","K02038","K02036","K07636","K07657","K03306")
Glycogen_module<- c("K00700","K00688","K00703","K01214","K15778","K00975","K00845")

In this example, the matrix includes all the bins identified. However, because this method is sensitive to incompleteness we will filter all genomes <80 % complete and with >5% contamination. It is also possible to use reference genomes.

# This code is only necessary to modify the original matrix
# These bins were identified through visual inspection of the relative abundances 
Clade_IIA_bins<-c(39,61,71,92,96,99)

# add a column with the bin number ($Bin). This one-liner parses out the Bin number from the Locus ID
# This is unnecessary if y
RNAseq_Annotated_Matrix$Bin<-gsub(".*\\.(.*)\\..*", "\\1", RNAseq_Annotated_Matrix[,1])

# Save the full matrix under another name because the subsequent step will filter out all Bins that are not >80% complete <5% contaminated
RNAseq_Annotated_Matrix_full<-RNAseq_Annotated_Matrix

# Merge Clade IIA bins 
RNAseq_Annotated_Matrix$Bin[which(RNAseq_Annotated_Matrix$Bin %in% Clade_IIA_bins)]<-39

# These are genomes that are  >80% complete <5% contaminated, define a subset matrix with only these genomes
high_quality_bins<-c(8,28,25,7,46,39,22,38,54,53,48,45,31,42,16,33,26,40,36,21,27,17,19,32,14,11,30,43,35,29,23,58,41,20,15,37,49,50)
RNAseq_Annotated_Matrix<-RNAseq_Annotated_Matrix[which(RNAseq_Annotated_Matrix$Bin %in% high_quality_bins),]

If the data is already in the correct format and contains only high-quality bins it should look like this:

Section B) Defining functions

Section C) Normalize and calculate background distributions for individual KOs and modules of ‘N’ KOs

Data will be normalized by the depth of sequencing and the number of reads mapped per genome. It will then be converted to \(log_{2}\) scale

Define inputs

no_feature<- c(9159700,4459877,9826273,8171512,9542765,10522313)
ambiguous<- c(3940698,2023389,4675033,3308789,6446272,5966543)
not_aligned<- c(0,19317660,0,0,0,0)

Normalize

RNAseq_Annotated_Matrix<-RNAseq_Normalize(RNAseq_Annotated_Matrix,no_feature,ambiguous,not_aligned)
kable(RNAseq_Annotated_Matrix[1:5,], caption = "Normalized Data")
Normalized Data
Locus_ID Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 KO Bin
104849 EBPR_Metabat_bin.7.fa_00001 4.446677 4.538381 5.048706 3.527324 4.201460 5.601435 K03496 7
104850 EBPR_Metabat_bin.7.fa_00002 5.016134 3.387445 4.386040 2.726439 5.374982 4.727191 7
104851 EBPR_Metabat_bin.7.fa_00003 3.737717 2.967767 4.982933 3.396566 2.584963 5.587142 7
104852 EBPR_Metabat_bin.7.fa_00004 0.000000 3.677773 6.873607 5.507952 4.658550 4.846946 K01007 7
104853 EBPR_Metabat_bin.7.fa_00005 3.742579 5.806160 5.895334 4.999853 5.140550 4.347732 7

Calculate Ranks

RNAseq_Annotated_Matrix<-Create_Rank_Columns(RNAseq_Annotated_Matrix)
kable(RNAseq_Annotated_Matrix[1:5,], caption = "Data with Rank Column added")
Data with Rank Column added
Locus_ID Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 KO Bin Rank1 Rank2 Rank3 Rank4 Rank5 Rank6
104849 EBPR_Metabat_bin.7.fa_00001 4.446677 4.538381 5.048706 3.527324 4.201460 5.601435 K03496 7 0.4931973 0.4115646 0.5700976 0.6545401 0.6978705 0.5229222
104850 EBPR_Metabat_bin.7.fa_00002 5.016134 3.387445 4.386040 2.726439 5.374982 4.727191 7 0.3986986 0.6118012 0.6657794 0.7388347 0.5198166 0.6755398
104851 EBPR_Metabat_bin.7.fa_00003 3.737717 2.967767 4.982933 3.396566 2.584963 5.587142 7 0.5839988 0.6610470 0.5797101 0.6739130 0.8450163 0.5246968
104852 EBPR_Metabat_bin.7.fa_00004 0.000000 3.677773 6.873607 5.507952 4.658550 4.846946 K01007 7 0.9322686 0.5446613 0.2796510 0.3642413 0.6342798 0.6552795
104853 EBPR_Metabat_bin.7.fa_00005 3.742579 5.806160 5.895334 4.999853 5.140550 4.347732 7 0.5767524 0.2136942 0.4318249 0.4380361 0.5579710 0.7215321

Define some features of the matrix

SS is the column which indicates the start of samples, SE is the sample end. Same for the ranks (RS and RE).

Bin_Column<-which(colnames(RNAseq_Annotated_Matrix)=="Bin")
SS<-2
SE<-length(sample_names)+1
RS<-Bin_Column+1
RE<-Bin_Column+length(sample_names)

How many rows have a SD of zero?

RNAseq_Annotation_Matrix_no_sd_of_zero<-which_rows_with_no_sd(RNAseq_Annotated_Matrix)

Of 138365 rows (transcripts), 1529 had a standard deviation of 0. These are not included when calculating background distibutions.

Here we calculate the background distribution of the individual KOs, and test whether there is a significant difference between a random pairing of genes in two bins, and a random pairing of genes with same function in two bins. When multiple genes with a given function are present in a genome, two alternative strategies are taken whereby either 1) a random pairwise distance is used or 2) the minimum distance is used.

All_KOs<-names(which(table(RNAseq_Annotated_Matrix$KO)>=5))[-1]

I_KOs_Background <- Individual_KOs_Background(RNAseq_Annotation_Matrix_no_sd_of_zero,10000)

t.test_KO_random_pearson<-t.test(I_KOs_Background$KO_pairwise_gene_correlation,I_KOs_Background$random_pairwise_gene_correlation, alternative="greater") # x > y (NULL)
t.test_H_KO_H_random_pearson<-t.test(I_KOs_Background$H_KO_pairwise_gene_correlation,I_KOs_Background$H_random_pairwise_gene_correlation, alternative="greater")
t.test_H_KO_KO_pearson<-t.test(I_KOs_Background$H_KO_pairwise_gene_correlation,I_KOs_Background$KO_pairwise_gene_correlation, alternative="greater")

t.test_KO_random_euclidean<-t.test(I_KOs_Background$KO_pairwise_gene_euclidean,I_KOs_Background$random_pairwise_gene_euclidean, alternative="less") # x > y (NULL)
t.test_H_KO_H_random_euclidean<-t.test(I_KOs_Background$H_KO_pairwise_gene_euclidean,I_KOs_Background$H_random_pairwise_gene_euclidean, alternative="less")
t.test_H_KO_KO_euclidean<-t.test(I_KOs_Background$H_KO_pairwise_gene_euclidean,I_KOs_Background$KO_pairwise_gene_euclidean, alternative="less")

t.test_KO_random_pearson$p.value

[1] 9.394149e-07

t.test_H_KO_H_random_pearson$p.value

[1] 9.542533e-07

# t.test_H_KO_KO_pearson # This is for comparing the two KO distributions

t.test_KO_random_euclidean$p.value

[1] 1.049752e-224

t.test_H_KO_H_random_euclidean$p.value

[1] 6.989331e-235

# t.test_H_KO_KO_euclidean # This is for comparing the two KO distributions

par(mfrow=c(2,2),mar=c(3,3,3,1))
# plot 1
plot(density(I_KOs_Background$random_pairwise_gene_correlation,adjust = 2,na.rm=TRUE),ylim=c(0,1),xlab="",ylab="",main="")
points(density(I_KOs_Background$KO_pairwise_gene_correlation,adjust = 2),typ="l",col="blue")
mtext(paste("p-value = ",signif(t.test_KO_random_pearson$p.value,2)),side=3,col="blue",padj=2,cex=.75)
title(ylab="Density", line=2, cex.lab=1)
title(xlab="PC", line=2, cex.lab=1)

# plot 2
plot(density(I_KOs_Background$H_random_pairwise_gene_correlation,adjust = 2),ylim=c(0,1),xlab="",ylab="",main=" ")
points(density(I_KOs_Background$H_KO_pairwise_gene_correlation,adjust = 2),typ="l",col="red")
mtext(paste("p-value = ",signif(t.test_H_KO_H_random_pearson$p.value,2)),side=3,col="red",padj=2,cex=.75)
title(ylab="Density", line=2, cex.lab=1)
title(xlab="PC", line=2, cex.lab=1)

# plot 3
plot(density(I_KOs_Background$random_pairwise_gene_euclidean,adjust = 2),typ="l" ,ylim=c(0,1.25),xlab="",ylab="",main="")
points(density(I_KOs_Background$KO_pairwise_gene_euclidean,adjust = 2),typ="l",col="blue")
title(ylab="Density", line=2, cex.lab=1)
title(xlab="NRED", line=2, cex.lab=1)
mtext(paste("p-value = ",signif(t.test_KO_random_euclidean$p.value,2)),side=3,col="blue",padj=2,cex=.75)

# plot 4
plot(density(I_KOs_Background$H_random_pairwise_gene_euclidean,adjust = 2),typ="l" ,ylim=c(0,1.25),xlab="",ylab="",main="")
points(density(I_KOs_Background$H_KO_pairwise_gene_euclidean,adjust = 2),typ="l",col="red")
title(ylab="Density", line=2, cex.lab=1)
title(xlab="NRED", line=2, cex.lab=1)
title(" \n\nComparison of random & functional \n pairwise comparisons", outer=TRUE) 
mtext(paste("p-value = ",signif(t.test_H_KO_H_random_euclidean$p.value,2)),side=3,col="red",padj=2,cex=.75)

# What would a background distribution look like if there were no ties (e.g. genes with the same counts)?
# random_euclidean_distances<-rep(NA,10000)
# for (i in 1:10000) {random_euclidean_distances[i]<-Calc_Norm_Euc(sample(seq(.001,1,.001),6),sample(seq(.001,1,.001),6))}

Based on these background statistics, means & standard deviations for calculating Z scores may be calcualted. Here we used the random background distribution (e.g. not necessarily between genes with the same KO annotations).

mu_pearson<-mean(I_KOs_Background$random_pairwise_gene_correlation)[1]
sd_pearson<-sd(I_KOs_Background$random_pairwise_gene_correlation)[1]
# A KO distribution
mu_KO_pearson<-mean(I_KOs_Background$KO_pairwise_gene_correlation)[1]
sd_KO_pearson<-sd(I_KOs_Background$KO_pairwise_gene_correlation)[1]
# A maximized background KO distribution
mu_H_KO_pearson<-mean(I_KOs_Background$H_KO_pairwise_gene_correlation)[1]
sd_H_KO_pearson<-sd(I_KOs_Background$H_KO_pairwise_gene_correlation)[1]

# means & standard deviations for calculating Z scores (NRED)
# A completely random background distribution
mu_euclidean<-mean(I_KOs_Background$random_pairwise_gene_euclidean)[1]
sd_euclidean<-sd(I_KOs_Background$random_pairwise_gene_euclidean)[1]
# A KO distribution
mu_KO_euclidean<-mean(I_KOs_Background$KO_pairwise_gene_euclidean)[1]
sd_KO_euclidean<-sd(I_KOs_Background$KO_pairwise_gene_euclidean)[1]
# A maximized background KO distribution
mu_H_KO_euclidean<-mean(I_KOs_Background$H_KO_pairwise_gene_euclidean)[1]
sd_H_KO_euclidean<-sd(I_KOs_Background$H_KO_pairwise_gene_euclidean)[1]

Z_random_pairwise_gene_correlation<-((I_KOs_Background$random_pairwise_gene_correlation)-mu_pearson)/sd_pearson
Z_random_pairwise_gene_euclidean<-((I_KOs_Background$random_pairwise_gene_euclidean)-mu_euclidean)/sd_euclidean

Z_H_KO_pairwise_gene_correlation<-((I_KOs_Background$H_KO_pairwise_gene_correlation)-mu_pearson)/sd_pearson
Z_H_KO_pairwise_gene_euclidean<-((I_KOs_Background$H_KO_pairwise_gene_euclidean)-mu_euclidean)/sd_euclidean

# df_random<-as.data.frame(cbind((Z_random_pairwise_gene_correlation),(Z_random_pairwise_gene_euclidean)))
# df_KO<-as.data.frame(cbind((KO_pairwise_gene_correlation),(KO_pairwise_gene_euclidean)))
# df_H_KO<-as.data.frame(cbind((Z_H_KO_pairwise_gene_euclidean),(H_KO_pairwise_gene_euclidean)))

Z_df_random<-as.data.frame(cbind(-(Z_random_pairwise_gene_correlation),(Z_random_pairwise_gene_euclidean)))
Z_df_H_KO<-as.data.frame(cbind(-(Z_H_KO_pairwise_gene_correlation),(Z_H_KO_pairwise_gene_euclidean)))

These Z-scores may be combined into a composite score.

rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(hexbin(Z_df_random), colramp=rf, mincnt=1, maxcnt=75,ylab="NRED",xlab="PC")

plot(hexbin(Z_df_H_KO), colramp=rf, mincnt=1, maxcnt=75,ylab="NRED",xlab="PC")

Section D) Calculate background distributions for modules of ‘N’ KOs

Section 5) Calculating statsitics for each module

Statistics for the PHA module

Statistics for the Glycogen module

Statistics for the polyP module